In this notebook, we are interested in exploring the use of miQC as an additional approach to filtering cells to remove any remaining cells that may not be viable or have low sequencing information.
Here, I am particularly comparing the use of miQC to using hard thresholds, in particular removing cells with greater than 20% mitochondrial content, less than 500 genes detected per cell, and less than 1000 total counts per cell.
For the most part, I followed the instructions in the miQC vignette for plotting and filtering.
I’m choosing to compare 2 single-cell samples (SCPCR000126, SCPCR000127) and 2 single-nuclei samples (SCPCR000118, SCPCR000119). We are particularly interested in how miQC will handle single-nuclei samples, as they should not contain many reads corresponding to mitochondrial genes.
library(magrittr)
library(ggplot2)
library(SingleCellExperiment)
library(scpcaTools)
library(gridExtra)
Attaching package: ‘gridExtra’
The following object is masked from ‘package:Biobase’:
combine
The following object is masked from ‘package:BiocGenerics’:
combine
# file paths to emptyDrops filtered sces from Alevin-fry
base_dir <- here::here()
cr_like_results_dir <- file.path(base_dir, "data", "results")
cr_like_sce_file <- file.path(cr_like_results_dir, "alevin-fry-cr-like-em-emptydrops-200-sces.rds")
cr_like_sce <- readr::read_rds(cr_like_sce_file)
cannot open file '/Users/allyhawkins/Documents/ALSF/git_repos/scpca-nf/data/results/alevin-fry-cr-like-em-emptydrops-200-sces.rds': No such file or directoryError in readRDS(con, refhook = refhook) : cannot open the connection
# read in mito gene list
sample_info_dir <- file.path(base_dir, "sample-info")
mito_file <- file.path(sample_info_dir, "Homo_sapiens.GRCh38.103.mitogenes.txt")
mito_genes <- readr::read_tsv(mito_file, col_names = "gene_id")
Rows: 37 Columns: 1
── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): gene_id
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mito_genes <- mito_genes %>%
dplyr::pull(gene_id) %>%
unique()
Before we can do any modeling and perform any filtering, let’s look at the distribution of mitochondrial fraction per cell and unique genes detected per cell. The assumptions of miQC rely on having a distribution of mitochondrial reads and unique genes found. It uses a joint model to look at the proportion of reads mapping to mitochondrial DNA and the number of detected genes and determines the probability of a cell being compromised based on this proportion.
To start we have to calculate the per cell statistics using addPerCellQC().
# add per cell qc
cr_like_sce <- cr_like_sce %>%
purrr::map(scater::addPerCellQC, subsets = list(mito = mito_genes))
plot_miQC_metrics <- function(sce, sample_name){
p <- miQC::plotMetrics(sce) +
ggtitle(sample_name)
print(p)
}
purrr::iwalk(cr_like_sce, plot_miQC_metrics)
Already you can see that the plot shows a distribution of mitochondrial reads from 0-100% for the two single cell samples, but not for the single-nuclei samples. This could affect the ability of the model to distinguish compromised cells.
Now we can actually look at the miQC model and what cells will be considered compromised and filtered out. Here, I’m plotting the probabilities that each cell is compromised as calculated by miQC, which cells would be filtered out using a given cutoff, and then comparing it to if we were to filter our cells using a pre determind threshold.
# function to test miQC modeling with different parameters on sce objects and show plots
model_sce <- function(sce, title, model_type = "linear", posterior_cutoff = 0.75){
# plot distributions of total counts/cell, genes detected/cell and mito content
coldata_df <- data.frame(colData(sce))
manual_filter_plot <- ggplot(coldata_df, aes(x = sum, y = detected, color = subsets_mito_percent)) +
geom_point(alpha = 0.5) +
scale_color_viridis_c() +
labs(x = "Total Count",
y = "Number of Genes Expressed",
color = "Mitochondrial\nFraction") +
geom_vline(xintercept = 500) +
geom_hline(yintercept = 1000) +
ggtitle(title) +
theme_classic()
# create model
sce_model <- miQC::mixtureModel(sce, model_type)
# plot cells colored by probability of cell being compromised or not
model_plot <- miQC::plotModel(sce, sce_model) +
theme_classic()
# plot cells that would be filtered using model vs cells that would be filtered using manual QC cutoffs
miQC_filter_plot <- miQC::plotFiltering(sce, sce_model, posterior_cutoff) +
geom_hline(yintercept = 20) +
geom_vline(xintercept = 500) +
theme_classic()
# arrange plots
grid.arrange(manual_filter_plot, model_plot, miQC_filter_plot, nrow = 2)
}
purrr::iwalk(cr_like_sce, model_sce)
Using the suggested parameters from miQC we see that it does a fairly good job of keeping cells with low mito content and high number of genes detected for the single-cell samples, except for a few cases in 126 and 127 where cells with low mito and high unique genes are being thrown out. For the single-nuclei samples however, it looks like the model is throwing away a lot more cells than we may want to probably due to the low range in mitochondrial percentages to begin with. Let’s see if we alter some of the parameters used, like the cutoff for filtering, or the model used to compute the probability of compromised cells, and if that impacts the output.
miQC will remove any cells that have a high probability of being compromised, by default these are cells with higher than 0.75 posterior probability. We can increase that threshold, which means that more cells will be kept and see if we are able to recover some of the cells that were being thrown out previously that we think should have been included (ie. cells with low mito that are being thrown out).
# change the posterior cutoff for miQC to keep throw away cells with > 0.9 probability of being compromised
purrr::iwalk(cr_like_sce, model_sce, posterior_cutoff = 0.9)
It doesn’t appear that increasing the probability threshold rescues very many more cells with low mito content in SCPCR000126 and SCPCR000127 that are being filtered out.
In addition to altering the threshold used for filtering, we can also test the use of different mixture models to calculate the posterior probability of a cell being compromised. We will test use of both the spline and polynomial modes rather than the default linear model. The polynomial model uses a single polynomial to model the entire datset, while spline uses a piecewise continuous function of many polynomials to model the data set.
# test spline mixture model rather than linear
purrr::iwalk(cr_like_sce, model_sce, model_type = "spline")
Use of the spline model definitely doesn’t seem ideal with one sample, SCPCR000126, showing the reverse of what we would expect and only keeping cells with low mito content. For the other samples we see a similar trend that we had observed in the linear model.
# test polynomial mixture model rather than linear
purrr::iwalk(cr_like_sce, model_sce, model_type = "polynomial")
For the linear model, we still see that cells with low mito are getting excluded from SCPCR000118 and SCPCR000119, but now see that same trend in SCPCR000220 and SCPCR000221 as well. It appears to model the cells with very high number of unique genes and low mito content as cells to exclude when we would rather keep those.
Now we can compare filtering using miQC vs. using the pre-determined cutoffs to see how the different filtering effects the overall number of cells that are removed. We can also look at the distribution of each of the metrics used for filtering, genes detected/cell, and mito content/cell to see how filtering affects the population of cells. Here we are using a threshold of 500 genes detected per cell and 10% mitochondrial reads.
# function to filter sce using miQC
miQC_filter <- function(sce, model_type = "linear", posterior_cutoff = 0.75){
sce_model <- miQC::mixtureModel(sce, model_type)
filtered_sce <- miQC::filterCells(sce, sce_model)
}
# filter using manual
manual_filter <- function(sce,
mito_cutoff = 20,
genes_detected_cutoff = 500) {
# get vector of cells to keep
cells_keep <- sce$detected > genes_detected_cutoff &
sce$subsets_mito_percent < mito_cutoff
# filter sce
filtered_sce <- sce[,cells_keep]
}
manual_filtered_sce <- purrr::map(cr_like_sce, manual_filter)
miQC_filtered_sce <- purrr::map(cr_like_sce, miQC_filter)
Removing 618 out of 1612 cells.Removing 1792 out of 6520 cells.Removing 959 out of 15870 cells.Removing 1184 out of 15673 cells.Removing 1537 out of 5032 cells.Removing 1145 out of 5306 cells.
manual_filtered_sce_10 <- purrr::map(cr_like_sce, manual_filter, mito_cutoff = 10)
manual_filtered_sce_mito_only <- purrr::map(cr_like_sce, manual_filter, genes_detected_cutoff = 0)
# function to grab number of cells from sce
get_num_cells <- function(sce){
num_cells <- dim(sce)[2]
}
manual_cell_num <- manual_filtered_sce %>%
purrr::map(get_num_cells) %>%
as.data.frame()
rownames(manual_cell_num) <- c("manual_mito_20")
manual_cell_num_10 <- manual_filtered_sce_10 %>%
purrr::map(get_num_cells) %>%
as.data.frame()
rownames(manual_cell_num_10) <- c("manual_mito_10")
manual_cell_num_mito_only <- manual_filtered_sce_mito_only %>%
purrr::map(get_num_cells) %>%
as.data.frame()
rownames(manual_cell_num_mito_only) <- c("manual_mito_only")
miQC_cell_num <- miQC_filtered_sce %>%
purrr::map(get_num_cells) %>%
as.data.frame()
rownames(miQC_cell_num) <- c("miQC")
pre_filter_cell_num <- cr_like_sce %>%
purrr::map(get_num_cells) %>%
as.data.frame()
rownames(pre_filter_cell_num) <- c("pre_filtering")
# create dataframe for plotting with number of cells per sce
combined_cell_num_df <- dplyr::bind_rows(manual_cell_num,
miQC_cell_num,
manual_cell_num_10,
manual_cell_num_mito_only,
pre_filter_cell_num) %>%
tibble::rownames_to_column("filtering_method") %>%
tidyr::pivot_longer(cols = starts_with("SCPCR"),
names_to = "sample",
values_to = "number_cells")
It looks like manual filtering (with these specific cutoff combinations) are removing more cells than miQC for single-cell, but more cells are removed in single-nuclei samples with either miQC than manual.
Suggested next steps include comparing PCA or UMAP results using the different filtering thresholds to identify if any clusters corresponding to “bad cells” remain. Additionally, we will need to identify if we want to add any additional criteria besides the posterior probability computed by miQC to be included as a “keep” cell in the ccdl_suggests column that will be appended to the colData for every filtered sce object.
sessioninfo::session_info()
─ Session info ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
─ Packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
package * version date lib source
AnnotationDbi 1.52.0 2020-10-27 [1] Bioconductor
askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2)
assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
beachmat 2.6.4 2020-12-20 [1] Bioconductor
beeswarm 0.4.0 2021-06-01 [1] CRAN (R 4.0.2)
Biobase * 2.50.0 2020-10-27 [1] Bioconductor
BiocFileCache 1.14.0 2020-10-27 [1] Bioconductor
BiocGenerics * 0.36.1 2021-04-16 [1] Bioconductor
BiocManager 1.30.16 2021-06-15 [1] CRAN (R 4.0.2)
BiocNeighbors 1.8.2 2020-12-07 [1] Bioconductor
BiocParallel 1.24.1 2020-11-06 [1] Bioconductor
BiocSingular 1.6.0 2020-10-27 [1] Bioconductor
biomaRt 2.46.3 2021-02-11 [1] Bioconductor
Biostrings 2.58.0 2020-10-27 [1] Bioconductor
bit 4.0.4 2020-08-04 [1] CRAN (R 4.0.2)
bit64 4.0.5 2020-08-30 [1] CRAN (R 4.0.2)
bitops 1.0-7 2021-04-24 [1] CRAN (R 4.0.2)
blob 1.2.2 2021-07-23 [1] CRAN (R 4.0.2)
bslib 0.2.5.1 2021-05-18 [1] CRAN (R 4.0.2)
cachem 1.0.6 2021-08-19 [1] CRAN (R 4.0.2)
cli 3.0.1 2021-07-17 [1] CRAN (R 4.0.2)
colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.0.2)
crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.2)
curl 4.3.2 2021-06-23 [1] CRAN (R 4.0.2)
DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.2)
dbplyr 2.1.1 2021-04-06 [1] CRAN (R 4.0.2)
DelayedArray 0.16.3 2021-03-24 [1] Bioconductor
DelayedMatrixStats 1.12.3 2021-02-03 [1] Bioconductor
digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.2)
dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.0.2)
dqrng 0.3.0 2021-05-01 [1] CRAN (R 4.0.2)
DropletUtils 1.13.2 2021-08-30 [1] Bioconductor
edgeR 3.32.1 2021-01-14 [1] Bioconductor
eisaR 1.2.0 2020-10-27 [1] Bioconductor
ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.0.2)
evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.1)
fansi 0.5.0 2021-05-25 [1] CRAN (R 4.0.2)
farver 2.1.0 2021-02-28 [1] CRAN (R 4.0.2)
fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.0.2)
flexmix 2.3-17 2020-10-12 [1] CRAN (R 4.0.2)
generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.2)
GenomeInfoDb * 1.26.7 2021-04-08 [1] Bioconductor
GenomeInfoDbData 1.2.4 2021-04-09 [1] Bioconductor
GenomicAlignments 1.26.0 2020-10-27 [1] Bioconductor
GenomicFeatures 1.42.3 2021-04-04 [1] Bioconductor
GenomicRanges * 1.42.0 2020-10-27 [1] Bioconductor
getopt 1.20.3 2019-03-22 [1] CRAN (R 4.0.2)
ggbeeswarm 0.6.0 2017-08-07 [1] CRAN (R 4.0.2)
ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.0.2)
glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
gridExtra * 2.3 2017-09-09 [1] CRAN (R 4.0.2)
grr 0.9.5 2016-08-26 [1] CRAN (R 4.0.2)
gtable 0.3.0 2019-03-25 [1] CRAN (R 4.0.2)
HDF5Array 1.18.1 2021-02-04 [1] Bioconductor
here 1.0.1 2020-12-13 [1] CRAN (R 4.0.2)
hms 1.1.0 2021-05-17 [1] CRAN (R 4.0.2)
htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.0.5)
httr 1.4.2 2020-07-20 [1] CRAN (R 4.0.2)
IRanges * 2.24.1 2020-12-12 [1] Bioconductor
irlba 2.3.3 2019-02-05 [1] CRAN (R 4.0.2)
jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.0.2)
jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.0.2)
knitr 1.33 2021-04-24 [1] CRAN (R 4.0.2)
labeling 0.4.2 2020-10-20 [1] CRAN (R 4.0.2)
lattice 0.20-44 2021-05-02 [1] CRAN (R 4.0.2)
lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.2)
limma 3.46.0 2020-10-27 [1] Bioconductor
locfit 1.5-9.4 2020-03-25 [1] CRAN (R 4.0.2)
magrittr * 2.0.1 2020-11-17 [1] CRAN (R 4.0.2)
Matrix 1.3-4 2021-06-01 [1] CRAN (R 4.0.2)
Matrix.utils 0.9.8 2020-02-26 [1] CRAN (R 4.0.2)
MatrixGenerics * 1.2.1 2021-01-30 [1] Bioconductor
matrixStats * 0.60.1 2021-08-23 [1] CRAN (R 4.0.2)
memoise 2.0.0 2021-01-26 [1] CRAN (R 4.0.2)
miQC 0.99.9 2021-04-19 [1] Github (greenelab/miQC@1c260dc)
modeltools 0.2-23 2020-03-05 [1] CRAN (R 4.0.2)
munsell 0.5.0 2018-06-12 [1] CRAN (R 4.0.2)
nnet 7.3-16 2021-05-03 [1] CRAN (R 4.0.2)
openssl 1.4.4 2021-04-30 [1] CRAN (R 4.0.2)
optparse 1.6.6 2020-04-16 [1] CRAN (R 4.0.2)
pillar 1.6.2 2021-07-29 [1] CRAN (R 4.0.2)
pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
progress 1.2.2 2019-05-16 [1] CRAN (R 4.0.2)
purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.0.2)
R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.0.2)
R.utils 2.10.1 2020-08-26 [1] CRAN (R 4.0.2)
R6 2.5.1 2021-08-19 [1] CRAN (R 4.0.2)
rappdirs 0.3.3 2021-01-31 [1] CRAN (R 4.0.2)
Rcpp 1.0.7 2021-07-07 [1] CRAN (R 4.0.2)
RCurl 1.98-1.4 2021-08-17 [1] CRAN (R 4.0.2)
readr 2.0.1 2021-08-10 [1] CRAN (R 4.0.2)
rhdf5 2.34.0 2020-10-27 [1] Bioconductor
rhdf5filters 1.2.1 2021-05-03 [1] Bioconductor
Rhdf5lib 1.12.1 2021-01-26 [1] Bioconductor
rlang 0.4.11 2021-04-30 [1] CRAN (R 4.0.2)
rmarkdown 2.10 2021-08-06 [1] CRAN (R 4.0.2)
rprojroot 2.0.2 2020-11-15 [1] CRAN (R 4.0.2)
Rsamtools 2.6.0 2020-10-27 [1] Bioconductor
RSQLite 2.2.8 2021-08-21 [1] CRAN (R 4.0.2)
rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.2)
rsvd 1.0.5 2021-04-16 [1] CRAN (R 4.0.2)
rtracklayer 1.50.0 2020-10-27 [1] Bioconductor
S4Vectors * 0.28.1 2020-12-09 [1] Bioconductor
sass 0.4.0 2021-05-12 [1] CRAN (R 4.0.5)
scales 1.1.1 2020-05-11 [1] CRAN (R 4.0.2)
scater 1.18.6 2021-02-26 [1] Bioconductor
scpcaTools * 0.1.0 2021-09-08 [1] local
scuttle 1.0.4 2020-12-17 [1] Bioconductor
sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
SingleCellExperiment * 1.12.0 2020-10-27 [1] Bioconductor
sparseMatrixStats 1.2.1 2021-02-02 [1] Bioconductor
stringi 1.7.4 2021-08-25 [1] CRAN (R 4.0.5)
stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
SummarizedExperiment * 1.20.0 2020-10-27 [1] Bioconductor
tibble 3.1.4 2021-08-25 [1] CRAN (R 4.0.5)
tidyr 1.1.3 2021-03-03 [1] CRAN (R 4.0.2)
tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.0.2)
tinytex 0.33 2021-08-05 [1] CRAN (R 4.0.2)
tzdb 0.1.2 2021-07-20 [1] CRAN (R 4.0.5)
utf8 1.2.2 2021-07-24 [1] CRAN (R 4.0.2)
vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.0.2)
vipor 0.4.5 2017-03-22 [1] CRAN (R 4.0.2)
viridis 0.6.1 2021-05-11 [1] CRAN (R 4.0.2)
viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.0.2)
vroom 1.5.4 2021-08-05 [1] CRAN (R 4.0.2)
withr 2.4.2 2021-04-18 [1] CRAN (R 4.0.5)
xfun 0.25 2021-08-06 [1] CRAN (R 4.0.2)
XML 3.99-0.7 2021-08-17 [1] CRAN (R 4.0.2)
xml2 1.3.2 2020-04-23 [1] CRAN (R 4.0.2)
XVector 0.30.0 2020-10-28 [1] Bioconductor
yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
zlibbioc 1.36.0 2020-10-28 [1] Bioconductor
[1] /Library/Frameworks/R.framework/Versions/4.0/Resources/library